{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Scraping Tabular Data with Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Medium Article Link: " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "url = 'https://en.wikipedia.org/wiki/The_World%27s_Billionaires'" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "df_list = pd.read_html(url)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "list" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# pd.read_html returns list\n", "type(df_list)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "32" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Total number of tables found\n", "len(df_list)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
No.NameNet worth (USD)AgeNationalitySource(s) of wealth
01Jeff Bezos$113 billion56United StatesAmazon
12Bill Gates$98 billion64United StatesMicrosoft
23Bernard Arnault and family$76 billion71FranceLVMH
34Warren Buffett$67.5 billion89United StatesBerkshire Hathaway
45Larry Ellison$59 billion75United StatesOracle Corporation
56Amancio Ortega$55.1 billion84SpainInditex, Zara
67Mark Zuckerberg$54.7 billion35United StatesFacebook
78Jim Walton$54.6 billion71United StatesWalmart
89Alice Walton$54.4 billion70United StatesWalmart
910S. Robson Walton$54.1 billion77United StatesWalmart
\n", "
" ], "text/plain": [ " No. Name Net worth (USD) Age Nationality \\\n", "0 1 Jeff Bezos $113 billion 56 United States \n", "1 2 Bill Gates $98 billion 64 United States \n", "2 3 Bernard Arnault and family $76 billion 71 France \n", "3 4 Warren Buffett $67.5 billion 89 United States \n", "4 5 Larry Ellison $59 billion 75 United States \n", "5 6 Amancio Ortega $55.1 billion 84 Spain \n", "6 7 Mark Zuckerberg $54.7 billion 35 United States \n", "7 8 Jim Walton $54.6 billion 71 United States \n", "8 9 Alice Walton $54.4 billion 70 United States \n", "9 10 S. Robson Walton $54.1 billion 77 United States \n", "\n", " Source(s) of wealth \n", "0 Amazon \n", "1 Microsoft \n", "2 LVMH \n", "3 Berkshire Hathaway \n", "4 Oracle Corporation \n", "5 Inditex, Zara \n", "6 Facebook \n", "7 Walmart \n", "8 Walmart \n", "9 Walmart " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_list[2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set a particular column as index" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
No.Net worth (USD)AgeNationalitySource(s) of wealth
Name
Jeff Bezos1$113 billion56United StatesAmazon
Bill Gates2$98 billion64United StatesMicrosoft
Bernard Arnault and family3$76 billion71FranceLVMH
Warren Buffett4$67.5 billion89United StatesBerkshire Hathaway
Larry Ellison5$59 billion75United StatesOracle Corporation
Amancio Ortega6$55.1 billion84SpainInditex, Zara
Mark Zuckerberg7$54.7 billion35United StatesFacebook
Jim Walton8$54.6 billion71United StatesWalmart
Alice Walton9$54.4 billion70United StatesWalmart
S. Robson Walton10$54.1 billion77United StatesWalmart
\n", "
" ], "text/plain": [ " No. Net worth (USD) Age Nationality \\\n", "Name \n", "Jeff Bezos 1 $113 billion 56 United States \n", "Bill Gates 2 $98 billion 64 United States \n", "Bernard Arnault and family 3 $76 billion 71 France \n", "Warren Buffett 4 $67.5 billion 89 United States \n", "Larry Ellison 5 $59 billion 75 United States \n", "Amancio Ortega 6 $55.1 billion 84 Spain \n", "Mark Zuckerberg 7 $54.7 billion 35 United States \n", "Jim Walton 8 $54.6 billion 71 United States \n", "Alice Walton 9 $54.4 billion 70 United States \n", "S. Robson Walton 10 $54.1 billion 77 United States \n", "\n", " Source(s) of wealth \n", "Name \n", "Jeff Bezos Amazon \n", "Bill Gates Microsoft \n", "Bernard Arnault and family LVMH \n", "Warren Buffett Berkshire Hathaway \n", "Larry Ellison Oracle Corporation \n", "Amancio Ortega Inditex, Zara \n", "Mark Zuckerberg Facebook \n", "Jim Walton Walmart \n", "Alice Walton Walmart \n", "S. Robson Walton Walmart " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_html(url, index_col=1)[2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Return tables containing a string or regex" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearNumber of billionairesGroup's combined net worth
020202095$8.0 trillion
120192153$8.7 trillion
220182208$9.1 trillion
320172043$7.7 trillion
420161810$6.5 trillion
\n", "
" ], "text/plain": [ " Year Number of billionaires Group's combined net worth\n", "0 2020 2095 $8.0 trillion\n", "1 2019 2153 $8.7 trillion\n", "2 2018 2208 $9.1 trillion\n", "3 2017 2043 $7.7 trillion\n", "4 2016 1810 $6.5 trillion" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_html(url, match='Number and combined net worth of billionaires by year')[0].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Specify strings to recognize as NA/NaN" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### without specifying `na_values`" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
9Net worth (1st)US$113 billion
10Number of billionaires2095
11Total list net worth valueUS$8 trillion
12New members to the list178
13Forbes: The World's Billionaires websiteForbes: The World's Billionaires website
\n", "
" ], "text/plain": [ " 0 \\\n", "9 Net worth (1st) \n", "10 Number of billionaires \n", "11 Total list net worth value \n", "12 New members to the list \n", "13 Forbes: The World's Billionaires website \n", "\n", " 1 \n", "9 US$113 billion \n", "10 2095 \n", "11 US$8 trillion \n", "12 178 \n", "13 Forbes: The World's Billionaires website " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_html(url)[0].tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### after specifying `na_values`" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
9Net worth (1st)US$113 billion
10Number of billionaires2095
11Total list net worth valueUS$8 trillion
12New members to the list178
13NaNNaN
\n", "
" ], "text/plain": [ " 0 1\n", "9 Net worth (1st) US$113 billion\n", "10 Number of billionaires 2095\n", "11 Total list net worth value US$8 trillion\n", "12 New members to the list 178\n", "13 NaN NaN" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_html(url,na_values=[\"Forbes: The World's Billionaires website\"])[0].tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Other parameters" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PublisherWhale Media InvestmentsForbes family
0PublicationForbes
1First publishedMarch 1987[1]
2Latest publicationApril 7, 2020
3Current list details (2020)[2]Current list details (2020)[2]
4WealthiestJeff Bezos
\n", "
" ], "text/plain": [ " Publisher Whale Media InvestmentsForbes family\n", "0 Publication Forbes\n", "1 First published March 1987[1]\n", "2 Latest publication April 7, 2020\n", "3 Current list details (2020)[2] Current list details (2020)[2]\n", "4 Wealthiest Jeff Bezos" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_html(url, skiprows=3, header=0)[0].head()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.10" } }, "nbformat": 4, "nbformat_minor": 4 }